Skip to main content

All Questions

Tagged with
0votes
2answers
71views

How to manage large datasets (approx 95GB)

I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...
apvn's user avatar
2votes
1answer
737views

The single CSV created by combining a large number of CSV files is too large to process. What options do I have?

The dataset I am currently working on has more than 100 csv files, with each of size more than 250MB. These are files containing time series data captured from different locations and all the files ...
odd_wolf's user avatar
2votes
1answer
65views

Best way to preprocess data

I need to create a machine learning model to predict if a structure is an hotel or an apartment. I have a dataset structured as well: ...
Fabio's user avatar
1vote
1answer
290views

How do I create a dataset from many CSV files that is too large for RAM

I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between ...
Finncent Price's user avatar
0votes
1answer
519views

How to do EDA on large datasets

I have a table in Postgres with ~5million records. When I load the dataset using pandas to perform EDA, I run out of memory. ...
MikeB's user avatar
1vote
1answer
43views

Where can I find a dataset that contains criminal case sentencing data? [closed]

I would like to study a dataset where each record represents a criminals case in the US and contains attributes such as: Type of crime Defendant Age/Sex/Race Plea Verdict Sentence Is there a dataset ...
N00b101's user avatar
2votes
1answer
121views

What is the difference between Pachyderm and Git?

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible ...
Lerner Zhang's user avatar
2votes
1answer
223views

Size of datasets over years

I am looking for statistics, to understand the evolution of the size of the (public) dataset over the years. I just found the following statistics: The poll of KDnuggets that actually shows that over ...
asdf's user avatar
  • 133
3votes
1answer
3kviews

How does skewed data affect deep neural networks?

I'm playing around with deep neural networks for a regression problem. The dataset I have is skewed right and for a linear regression model, I would typically perform a log transform. Should I be ...
shaye059's user avatar
1vote
1answer
344views

Public dataset for news articles with their associated categories for multilabel data classification

I am wondering if there are any public datasets of news, like The New York Times (NYT) or similar to various news categories such as politics, entertainment, lifestyle, general news, sports, etc. I ...
Anonymous's user avatar
3votes
1answer
10kviews

How to determine sample rate of a time series dataset?

I have a dataset of magnetometer sensor readings which looks like: ...
harry r's user avatar
1vote
0answers
34views

What is important for Pharmaceutical companies to answer with Big Data Analysis?

I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big ...
Rebecca's user avatar
0votes
1answer
25views

Suggestion of dataset

I am implementing my own deep network, but I am not so good at calculus so my network only works for binary data in the moment. I have been searching for big tabular datasets that are for binary ...
panchester's user avatar
1vote
1answer
2kviews

How to compute modulo of a hash?

Let's say that I have a set of users in my database, that have GUIDs as their IDs. I use xxhash to generate fixed-length hashes for each value, so that I can then ...
Den's user avatar
  • 113
2votes
0answers
69views

How to deal with large datasets? [closed]

I have some experience with data science but I wanted some insight on how to deal with a very large dataset. I understand simply downloading it to your computer is not plausible so where do you even ...
jeff's user avatar

153050per page
close